AITopics | full data

A Fast and Accurate Estimator for Large Scale Linear Model via Data Averaging

Neural Information Processing SystemsFeb-13-2026, 20:20:55 GMT

The asymptotic behavior of the proposed estimation procedure is studied. Our theoretical results show that the proposed method can achieve a faster convergence rate than the optimal convergence rate for sampling methods.

algorithm, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > China > Beijing > Beijing (0.05)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Variational Bayesian Unlearning

Neural Information Processing SystemsDec-24-2025, 12:07:48 GMT

This paper studies the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased. We frame this problem as one of minimizing the Kullback-Leibler divergence between the approximate posterior belief of model parameters after directly unlearning from erased data vs. the exact posterior belief from retraining with remaining data. Using the variational inference (VI) framework, we show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief given the full data (i.e., including the remaining data); the latter prevents catastrophic unlearning that can render the model useless. In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging. We propose two novel tricks to tackle this challenge. We empirically demonstrate our unlearning methods on Bayesian models such as sparse Gaussian process and logistic regression using synthetic and real-world datasets.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

6de668dab370194fa304a08be5aacd85-Paper-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 21:10:23 GMT

algorithm, artificial intelligence, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > China > Beijing > Beijing (0.05)

Genre: Research Report (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Coresets for Archetypal Analysis

Sebastian Mair, Ulf Brefeld

Neural Information Processing SystemsOct-3-2025, 02:13:28 GMT

Neural Information Processing Systems http://nips.cc/

archetypal analysis, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Country: North America (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.73)

Add feedback

ments [ ] The experimental analysis of Bachem et al. (2018) shows that the lightweight-coreset performs very similar

Neural Information Processing SystemsOct-3-2025, 02:13:13 GMT

We thank all reviewers for their careful reading and their valuable comments. As seen in the figure on the right, the performance of Lucic et al. (2016) We now included this baseline in the paper. R1: The dimension of B is stated wrongly [..] Thank you for pointing "In contrast to k-means, we assume that the mean .." is not clear to me. Thank you for raising this issue. Reviewer 3 also pointed this out.

archetype, artificial intelligence, machine learning, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.37)

Add feedback

Median Selection Subset Aggregation for Parallel Inference

Xiangyu Wang, Peichao Peng, David B. Dunson

Neural Information Processing SystemsFeb-9-2025, 22:18:56 GMT

For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. We propose a MEdian Selection Subset AGgregation Estimator (message) algorithm, which attempts to solve these problems. The algorithm applies feature selection in parallel for each subset using Lasso or another method, calculates the'median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in both sample and feature size, and has theoretical guarantees. In particular, we show model selection consistency and coefficient estimation efficiency. Extensive experiments show excellent performance in variable selection, estimation, prediction, and computation time relative to usual competitors.

algorithm, artificial intelligence, machine learning, (11 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Predictive Coresets

Flores, Bernardo

arXiv.org Artificial IntelligenceFeb-8-2025

We propose a construction of coresets based on a predictive view of Bayesian posterior inference (Fong et al., 2024; Fortini and Petrone, 2012). The main attraction of the approach is the model-agnostic nature - the method is valid with any inference model and independent of the specific inference goals, making it highly adaptable for a wide range of applications. Such adaptability is particularly valuable in the context of large-scale datasets, now commonplace in fields like genomics and astronomy. While this explosion of data offers incredible opportunities for discoveries, it also brings significant computational challenges. Tasks that were once straightforward, such as evaluating likelihoods several times have become increasingly difficult, making traditional data processing methods impractical. These obstacles have frequently pushed practitioners toward simpler statistical models that might not capture the full complexity of the data, disregarding expressiveness and flexibility that rich hierarchical and nonparametric models can offer.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.05725

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.65)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Add feedback

Variational Bayesian Unlearning

Neural Information Processing SystemsOct-11-2024, 05:01:48 GMT

This paper studies the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased. We frame this problem as one of minimizing the Kullback-Leibler divergence between the approximate posterior belief of model parameters after directly unlearning from erased data vs. the exact posterior belief from retraining with remaining data. Using the variational inference (VI) framework, we show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief given the full data (i.e., including the remaining data); the latter prevents catastrophic unlearning that can render the model useless. In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging. We propose two novel tricks to tackle this challenge.

bayesian model, posterior belief, variational bayesian unlearning, (1 more...)

Neural Information Processing Systems

Genre: Research Report (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A model-free subdata selection method for classification

Singh, Rakhi

arXiv.org Machine LearningApr-29-2024

Subdata selection is a study of methods that select a small representative sample of the big data, the analysis of which is fast and statistically efficient. The existing subdata selection methods assume that the big data can be reasonably modeled using an underlying model, such as a (multinomial) logistic regression for classification problems. These methods work extremely well when the underlying modeling assumption is correct but often yield poor results otherwise. In this paper, we propose a model-free subdata selection method for classification problems, and the resulting subdata is called PED subdata. The PED subdata uses decision trees to find a partition of the data, followed by selecting an appropriate sample from each component of the partition. Random forests are used for analyzing the selected subdata. Our method can be employed for a general number of classes in the response and for both categorical and continuous predictors. We show analytically that the PED subdata results in a smaller Gini than a uniform subdata. Further, we demonstrate that the PED subdata has higher classification accuracy than other competing methods through extensive simulated and real datasets.

dataset, subdata, subdata selection method, (16 more...)

arXiv.org Machine Learning

2404.19127

Country:

North America > United States > New York > Broome County > Binghamton (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (0.48)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.49)

Add feedback

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Joshi, Siddharth, Jain, Arnav, Payani, Ali, Mirzasoleiman, Baharan

arXiv.org Artificial IntelligenceMar-19-2024

Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip.

latent class, representation, subset, (16 more...)

arXiv.org Artificial Intelligence

2403.12267

Country: